If you haven’t already, clone the github repo: git clone git@github.com:morinlab/GAMBLR.git Run this vignette using the base directory of the GAMBLR repository as your working directory. See the Readme in the repository for more tips on getting set up.
# Load the GAMBLR package and other packages required by these examples.
library(GAMBLR)
require(tidyverse)
require(maftools)
require(circlize)
require(data.table)
require(rtracklayer)
require(RMariaDB)
require(DBI)The key level-1 GAMBL outputs are making their way into a MySQL (a.k.a. MariaDB) database. The most intuitive way to query the data is using the dbplyr R package. The initial connection and a method to determine what each table contains is shown below.
con <- DBI::dbConnect(RMariaDB::MariaDB(), dbname = "gambl_test")
# use DBI function to list the tables
all_table_names = DBI::dbListTables(con)
print(all_table_names)
#> [1] "analysis" "bedpe_manta_hg19" "biopsy_metadata"
#> [4] "derived_data" "maf_slms3_hg19" "maf_slms3_hg19_icgc"
#> [7] "maf_slms3_hg38" "outcome_metadata" "sample_metadata"
#> [10] "seg_battenberg_hg19"
# peek at the contents of each table.
for(table_name in all_table_names){
table_db <- tbl(con, table_name)
print(table_name)
}
#> [1] "analysis"
#> [1] "bedpe_manta_hg19"
#> [1] "biopsy_metadata"
#> [1] "derived_data"
#> [1] "maf_slms3_hg19"
#> [1] "maf_slms3_hg19_icgc"
#> [1] "maf_slms3_hg38"
#> [1] "outcome_metadata"
#> [1] "sample_metadata"
#> [1] "seg_battenberg_hg19"Get the metadata, automagically joining both tables and dropping normals and RNA-seq samples.
my_metadata = get_gambl_metadata()
# reduce to some of the more useful metadata fields
my_metadata = my_metadata %>% select(sample_id,biopsy_id,myc_ba,cohort,pathology)
print(head(my_metadata))
#> sample_id biopsy_id myc_ba cohort pathology
#> 1 BLGSP-71-06-00001-01A-11D BLGSP-71-06-00001-01 POS BL_Pediatric BL
#> 2 BLGSP-71-06-00004-01A-11D BLGSP-71-06-00004-01 POS BL_Pediatric BL
#> 3 BLGSP-71-06-00013-01B-11D BLGSP-71-06-00013-01 POS BL_Pediatric BL
#> 4 BLGSP-71-06-00080-01A-11D BLGSP-71-06-00080-01 POS BL_Pediatric BL
#> 5 BLGSP-71-06-00019-01A-11D BLGSP-71-06-00019-01 POS BL_Pediatric BL
#> 6 BLGSP-71-06-00008-01A-11D BLGSP-71-06-00008-01 NEG BL_Pediatric BL
#An important feature for reproducibility is that we all use the same exact subset of samples when performing various level-3 analyses for a study. This function effectively "locks in" a set of cases for a study based on some filters it applies automatically based on a study set identifier. For retrieving the metadata for all BLGSP cases with WGS data, you can use the following:
blgsp_metadata = get_gambl_metadata(case_set="BLGSP-study")
#as you can see this spans several cohorts and doesn't even just include BL pathology (for complex reasons)
blgsp_metadata %>% pull(cohort) %>% table()
#> .
#> BL_Adult BL_cell_lines BLGSP_Bcell_UNC BL_ICGC BL_Pediatric
#> 81 23 6 17 124
#If you want the sample_id (i.e. the Tumor_Sample_Barcode) for all these cases, for example to subset a MAF file, you can extract them into a vector:
blgsp_sample_ids = pull(blgsp_metadata,sample_id)
length(blgsp_sample_ids)
#> [1] 251The largest tables in the database are those that contain mutations from MAF files. The tables are simply all the rows and columns from a given merged MAF file. There will likely be four MAF-derived tables in this database, one for each reference build for “GAMBL” and the complete “GAMBL + ICGC”. Although the names may change, currently the tables are implicitly GAMBL-only unless they contain “icgc” in their name. For example, maf_slms3_hg19_icgc contains all hg19 (grch37/hs37d5) mutations from GAMBL and icgc_dart/external cases.
hg19_maf = tbl(con,"maf_slms3_hg19_icgc")
mutation_counts = hg19_maf %>% group_by(Tumor_Sample_Barcode) %>% tally()
print(head(mutation_counts))
#> # Source: lazy query [?? x 2]
#> # Database: mysql [rmorin@mysql01:NA/gambl_test]
#> Tumor_Sample_Barcode n
#> <chr> <int64>
#> 1 00-14595_tumorA 7139
#> 2 00-14595_tumorB 18212
#> 3 00-14595_tumorC 27633
#> 4 00-15201_tumorA 6423
#> 5 00-15201_tumorB 8255
#> 6 00-20702T 6335
mutation_counts %>% ggplot() + geom_histogram(aes(x=n),bins=100) + xlim(c(10,25000))
#> Warning: Removed 142 rows containing non-finite values (stat_bin).
#> Warning: Removed 2 rows containing missing values (geom_bar).Here’s a histogram of the total number of mutations per genome across all GAMBL cases. Pretty hard to interpret this without metadata. There are two metadata tables. Often we need to join them both to get all the information we want.
sample_meta = tbl(con,"sample_metadata") %>% filter(seq_type == "genome" & tissue_status == "tumour")
#if we only care about genomes, we can drop/filter anything that isn't a tumour genome
#The key for joining this table to the mutation information is to use sample_id. Think of this as equivalent to a library_id. It will differ depending on what assay was done to the sample.
biopsy_meta = tbl(con,"biopsy_metadata") %>% select(-patient_id) %>% select(-pathology) %>% select(-time_point) %>% select(-EBV_status_inf)
# this table is keyed on biopsy_id. One biopsy can have more than one sample_id. This table must be joined to the sample table using biopsy_id. Because of some redundancy in columns between the two tables, I've dropped all redundant columns with the exception of biopsy_id.
all_meta = left_join(sample_meta,biopsy_meta,by="biopsy_id")
# IMPORTANT: the dbplyr package uses mysql queries under the hood to lazily retrieve the datat you need for each table on-the-fly. To properly use the efficiency of indexing and joins in MySQL, don't convert your tables into data frames until you're done all the necessary joins. Obtain all SV breakpoints (called by Manta) within a specific region of the genome and visualize them.
myc_locus_sv = get_manta_sv(region="8:128723128-128774067",pass=FALSE,with_chr_prefix=TRUE)
# we can override default that requires SV to meet the Manta "Pass" filtering criterion
# here we are also asking for the chromosomes to be named with a chr prefix (for circlize compatability)
bed1 = myc_locus_sv %>% select(CHROM_A,START_A,END_A,tumour_sample_id)
bed2 = myc_locus_sv %>% select(CHROM_B,START_B,END_B,tumour_sample_id)
colnames(bed1)=c("chrom","start","end","sample_id")
colnames(bed2)=c("chrom","start","end","sample_id")
myc_locus_sv = myc_locus_sv %>% select(tumour_sample_id,CHROM_A,START_A,CHROM_B,START_B,STRAND_A,STRAND_B,VAF_tumour)
print(head(myc_locus_sv))
#> tumour_sample_id CHROM_A START_A CHROM_B START_B STRAND_A STRAND_B
#> 1 00-15201_tumorB chr8 128727469 chr8 128727743 + +
#> 2 01-12047_tumorC chr8 128748814 chr14 106994498 - +
#> 3 01-14774_tumorA chr8 128727469 chr8 128727743 + +
#> 4 01-20774T chr8 66878096 chr8 128735744 + -
#> 5 01-20774T chr8 128771911 chr8 128920622 + -
#> 6 01-23117_tumorB chr8 128749370 chr14 106323836 - +
#> VAF_tumour
#> 1 0.151
#> 2 0.429
#> 3 0.163
#> 4 0.358
#> 5 0.951
#> 6 0.345
circos.initializeWithIdeogram()
circos.genomicLink(bed1, bed2,col = rand_color(nrow(bed1),transparency=0.5)) This package also includes a convenience function to facilitate annotation of common oncogene fusions/SVs in B-cell lymphomas.
unannotated_sv = get_manta_sv()
#in this example, let's just look at the SVs annotated as likely driving BCL6 expression
annotated_sv = annotate_sv(unannotated_sv,with_chr_prefix = TRUE) %>%
filter(!is.na(partner)) %>%
#select(-entrez) %>%
filter(gene=="BCL6") %>%
as.data.frame()
#> Warning in if (!grepl("chr", all.annotated$chrom1)) {: the condition has length
#> > 1 and only the first element will be used
print(head(annotated_sv))
#> chrom1 start1 end1 chrom2 start2 end2 name score strand1
#> 1 chr3 187462585 187462585 chr6 27107409 27107409 . 78 +
#> 2 chr3 187462600 187462600 chr6 27107409 27107409 . 90 -
#> 3 chr3 187462585 187462585 chr6 27107409 27107409 . 67 +
#> 4 chr3 187462600 187462600 chr6 27107409 27107409 . 93 -
#> 5 chr3 187462525 187462525 chr12 25207319 25207319 . 118 +
#> 6 chr3 187462548 187462548 chr12 25207342 25207342 . 97 -
#> strand2 tumour_sample_id gene partner fusion
#> 1 + 01-14774_tumorA BCL6 HIST1H2BK HIST1H2BK-BCL6
#> 2 - 01-14774_tumorA BCL6 HIST1H2BK HIST1H2BK-BCL6
#> 3 + 01-14774_tumorB BCL6 HIST1H2BK HIST1H2BK-BCL6
#> 4 - 01-14774_tumorB BCL6 HIST1H2BK HIST1H2BK-BCL6
#> 5 + 02-24492_tumorA BCL6 LRMP LRMP-BCL6
#> 6 - 02-24492_tumorA BCL6 LRMP LRMP-BCL6
#for labelling, get the unique set of partners
to_label =unique(annotated_sv$partner)
partner_label = grch37_partners %>%
filter(gene %in% to_label) %>% mutate(chrom = paste0("chr",chrom)) %>%
as.data.frame()
bed1 = annotated_sv %>% select(chrom1,start1,end1,tumour_sample_id,fusion)
bed2 = annotated_sv %>% select(chrom2,start2,end2,tumour_sample_id,fusion)
colnames(bed1)=c("chrom","start","end","sample_id","fusion")
colnames(bed2)=c("chrom","start","end","sample_id","fusion")
#circos.initializeWithIdeogram()
circos.clear()
circos.initializeWithIdeogram(plotType = NULL,chromosome.index = paste0("chr", c(1:22,"X")))
circos.genomicLabels(partner_label, labels.column = 4, side = "outside", cex=0.4,col='black')
circos.genomicIdeogram()
circos.genomicLink(bed1, bed2,col = rand_color(nrow(bed1),transparency=0.5))
unannotated_sv = get_manta_sv()
#Now let's try the SVs annotated as likely driving MYC expression
annotated_sv = annotate_sv(unannotated_sv,with_chr_prefix = TRUE) %>%
filter(!is.na(partner)) %>%
filter(gene=="MYC") %>%
as.data.frame()
#> Warning in if (!grepl("chr", all.annotated$chrom1)) {: the condition has length
#> > 1 and only the first element will be used
print(head(annotated_sv))
#> chrom1 start1 end1 chrom2 start2 end2 name score strand1
#> 1 chr8 134207510 134207517 chr9 36763818 36763825 . 46 +
#> 2 chr8 128748814 128748814 chr14 106994498 106994498 . 85 -
#> 3 chr8 128749370 128749370 chr14 106323836 106323836 . 122 -
#> 4 chr8 128749348 128749348 chr14 106323928 106323928 . 53 +
#> 5 chr8 128746623 128746625 chr14 106328385 106328387 . 60 -
#> 6 chr8 128922317 128922939 chr22 23114594 23114952 . 88 +
#> strand2 tumour_sample_id gene partner fusion
#> 1 - 00-26427_tumorA MYC PAX5 PAX5-MYC
#> 2 + 01-12047_tumorC MYC IGH IGH-MYC
#> 3 + 01-23117_tumorB MYC IGH IGH-MYC
#> 4 - 01-23117_tumorB MYC IGH IGH-MYC
#> 5 + 02-24492_tumorB MYC IGH IGH-MYC
#> 6 - 05-10981T MYC IGL IGL-MYC
#for labelling, get the unique set of partners
to_label =unique(annotated_sv$partner)
partner_label = grch37_partners %>%
filter(gene %in% to_label) %>% mutate(chrom = paste0("chr",chrom)) %>%
as.data.frame()
onco_label = grch37_oncogene %>% mutate(chrom = paste0("chr",chrom)) %>%
as.data.frame()
bed1 = annotated_sv %>% select(chrom1,start1,end1,tumour_sample_id,fusion)
bed2 = annotated_sv %>% select(chrom2,start2,end2,tumour_sample_id,fusion)
colnames(bed1)=c("chrom","start","end","sample_id","fusion")
colnames(bed2)=c("chrom","start","end","sample_id","fusion")
circos.initializeWithIdeogram(plotType = NULL,chromosome.index = paste0("chr", c(1:22,"X")))
#circos.genomicLabels(partner_label, labels.column = 4, side = "outside", cex=0.4,col='black')
#circos.genomicLabels(onco_label, labels.column = 4, side = "outside", cex=0.4,col='red')
all_labels = rbind(onco_label,partner_label)
cols= c(rep("red",length(onco_label$chrom)),rep("black",length(partner_label$chrom)))
circos.genomicLabels(all_labels, labels.column = 4, side = "outside", cex=0.7,col=cols)
circos.genomicIdeogram()
circos.genomicLink(bed1, bed2,col = rand_color(nrow(bed1),transparency=0.5))This package has convenience features such as the ability to run liftOver on a set of coordinates (e.g. in bedpe format)
bedpe_hg19 = unannotated_sv %>% head(20) %>% as.data.frame()
print(head(bedpe_hg19))
#> CHROM_A START_A END_A CHROM_B START_B END_B NAME SOMATIC_SCORE
#> 1 1 1556541 1556547 1 1556664 1556670 . 40
#> 2 1 6012725 6012732 1 6012825 6012832 . 48
#> 3 1 8464072 8464090 1 8464293 8464311 . 40
#> 4 1 10084099 10084251 1 10084266 10084411 . 48
#> 5 1 10526162 10526172 1 10526290 10526300 . 40
#> 6 1 15878430 15878436 1 15878608 15878614 . 40
#> STRAND_A STRAND_B TYPE FILTER VAF_tumour VAF_normal DP_tumour DP_normal
#> 1 - - BND PASS 0.145 0 55 73
#> 2 + + BND PASS 0.156 0 32 85
#> 3 + + BND PASS 0.112 0 215 84
#> 4 - - BND PASS 0.186 0 43 50
#> 5 - - BND PASS 0.157 0 121 92
#> 6 + + BND PASS 0.119 0 126 101
#> tumour_sample_id normal_sample_id pair_status
#> 1 00-14595_tumorA 00-14595_normal matched
#> 2 00-14595_tumorA 00-14595_normal matched
#> 3 00-14595_tumorA 00-14595_normal matched
#> 4 00-14595_tumorA 00-14595_normal matched
#> 5 00-14595_tumorA 00-14595_normal matched
#> 6 00-14595_tumorA 00-14595_normal matched
bedpe_hg38 = liftover_bedpe(bedpe_df = bedpe_hg19,target_build = "hg38")
#> Warning in if (!grepl("chr", original_bedpe$CHROM_A)) {: the condition has
#> length > 1 and only the first element will be used
#> Warning: 20 parsing failures.
#> row col expected actual file
#> 1 -- 6 columns 12 columns literal data
#> 2 -- 6 columns 12 columns literal data
#> 3 -- 6 columns 12 columns literal data
#> 4 -- 6 columns 12 columns literal data
#> 5 -- 6 columns 12 columns literal data
#> ... ... ......... .......... ............
#> See problems(...) for more details.
#> Warning: 20 parsing failures.
#> row col expected actual file
#> 1 -- 6 columns 12 columns literal data
#> 2 -- 6 columns 12 columns literal data
#> 3 -- 6 columns 12 columns literal data
#> 4 -- 6 columns 12 columns literal data
#> 5 -- 6 columns 12 columns literal data
#> ... ... ......... .......... ............
#> See problems(...) for more details.
print(head(bedpe_hg38))
#> # A tibble: 6 x 19
#> CHROM_A START_A END_A CHROM_B START_B END_B NAME SOMATIC_SCORE STRAND_A
#> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <chr> <dbl> <chr>
#> 1 chr1 1.62e6 1.62e6 chr1 1.62e6 1.62e6 . 40 -
#> 2 chr1 5.95e6 5.95e6 chr1 5.95e6 5.95e6 . 48 +
#> 3 chr1 8.40e6 8.40e6 chr1 8.40e6 8.40e6 . 40 +
#> 4 chr1 1.00e7 1.00e7 chr1 1.00e7 1.00e7 . 48 -
#> 5 chr1 1.05e7 1.05e7 chr1 1.05e7 1.05e7 . 40 -
#> 6 chr1 1.56e7 1.56e7 chr1 1.56e7 1.56e7 . 40 +
#> # … with 10 more variables: STRAND_B <chr>, TYPE <chr>, FILTER <chr>,
#> # VAF_tumour <dbl>, VAF_normal <dbl>, DP_tumour <int>, DP_normal <int>,
#> # tumour_sample_id <chr>, normal_sample_id <chr>, pair_status <chr>Working with all mutations from a sample.
# we can also directly query the database to get a MAF per patient for patient-centric visualizations. # this is not using the GAMBLR functions but shows an example of how unsupported queries can be accomplished. Beware queries that will return many thousands of variants. Thes will be slow and may fail if they're too greedy
# note that here I'm not restricting to only coding variants
example_dlbcl = hg19_maf %>% filter(Tumor_Sample_Barcode == "HTMCP-01-06-00422-01A-01D")
example_dlbcl_df = example_dlbcl %>% as.data.frame()
example_dlbcl_maf = read.maf(example_dlbcl_df)
#> -Validating
#> -Silent variants: 4552
#> -Summarizing
#> -Processing clinical data
#> --Missing clinical data
#> -Finished in 0.109s elapsed (0.363s cpu)
rainfallPlot(example_dlbcl_maf)
#> Processing HTMCP-01-06-00422-01A-01D..What if you want to view the copy number landscape of a single patient and the VAF of somatic mutations in different copy number states? This can sometimes help you scrutinize whether the copy number fit looks reasonable.
#use the same sample as the previous example
my_sample = "HTMCP-01-06-00422-01A-01D"
copy_number_vaf_plot(this_sample=my_sample)
#> [1] "copy number colours"
#> Warning: Removed 52 rows containing missing values (geom_point).
#what if we want to focus on putative driver mutations? You can restrict this plot just to coding mutations and label genes of your choice.
#use the built in lymphoma gene list and subset for BL or DLBCL
my_genes=lymphoma_genes %>% filter(BL==TRUE | DLBCL == TRUE) %>% pull(Gene)
copy_number_vaf_plot(this_sample=my_sample,coding_only = TRUE,genes_to_label = my_genes)
#> [1] "copy number colours"
#> Warning: Removed 2 rows containing missing values (geom_point).You may also be interested in visualizing all mutations across a given cohort (or every genome). This can only be done in relatively small regions. Here’s a way to visualize the non-coding mutation pattern across a region of interest.
# set up some coordinates to annotate in your plot (optional)
mybed = data.frame(start=c(128806578,128805652,128748315), end=c(128806992,128809822,128748880), name=c("TSS","enhancer","MYC-e1"))
# get the mutations within a region of interest
# note that we can specify the query chromosome with or without a chr prefix and it will be handled elegantly
my_mutations = get_ssm_by_region(chromosome="chr8",qstart=128743606,qend=128820015)
ashm_rainbow_plot(mutations_maf=my_mutations,metadata=my_metadata,bed=mybed)A more advanced and fairly automatic plotting function for exploring multiple aSHM regions at once.
# Handy function that provides a vector of colours for giving points for different pathology/subgroups reproducible and distinguishable colours
lymphgen_colours = get_gambl_colours(classification="lymphgen")
# This package comes with some custom (curated) data such as the regions recurrently affected by hypermutation in B-NHLs
ashm_multi_rainbow_plot(regions_to_display=c("BCL2-TSS","MYC-TSS","SGK1-TSS","IGL"),custom_colours = lymphgen_colours)
#> Joining, by = "sample_id"
#> Joining, by = "start"
#> Joining, by = "sample_id"